R Laboratory – Exercise 1

American Airlines Employees

Reading Data

We begin by importing the necessary packages and setting ggplot-related stuff.

library(kableExtra)
library(tidyverse)

font <- "Roboto Condensed"
theme_set(theme_minimal(base_size = 14, base_family = font))
my_palette <- unname(palette.colors(4, "Okabe-Ito"))

Now we can start importing data. Let’s first put each company’s data in a separate tibble, and then merge them in a single one, adding a column with the company name. Also, we’ll combine the month and year columns into a single date column (labeled as month).

names <- c("month", "year", "full_time", "part_time", "total")
american <- read_table("data/american_airline_empl.txt",
                       skip = 1, col_names = names)
delta <- read_table("data/delta_airline_empl.txt",
                    skip = 1, col_names = names)
federal <- read_table("data/federal_express_empl.txt",
                      skip = 1, col_names = names)
united <- read_table("data/united_airline_empl.txt",
                      skip = 1, col_names = names)

# start with a named list to recover the names when binding rows
employees <- list(american = american, delta = delta,
                   federal = federal, united = united) |>
    bind_rows(.id = "company") |>
    mutate(month = make_date(year = year, month = month)) |>
    select(-year)
employees tibble
company month full_time part_time total
american 1990-01-01 68137 9039 77176
american 1990-02-01 68725 9273 77998
american 1990-03-01 69509 9376 78885
american 1990-04-01 69713 9326 79039
american 1990-05-01 70376 9309 79685
american 1990-06-01 71258 9369 80627

We’ll also keep track of properly-typeset version of the company and jobs names, to use them later in the plots.

full_names <- list(american = "American Airlines",
                   delta = "Delta Airlines",
                   federal = "Federal Express",
                   united = "United Airlines")
job_type <- list(full_time = "Full-time",
                 part_time = "Part-time",
                 total = "Total")

Full-Time vs Part-Time Employees

Here we’ll generate a plot of the number of employees in time, for all four companies. We’ll keep the full-time and part-time workers’ numbers separated, and also produce a plot with the total.

lapply(names(job_type), function(idx) {
    employees |>
    ggplot(aes(x = month, y = .data[[idx]], colour = company)) +
        geom_line(lwd = 0.8) +
        scale_colour_manual(values = my_palette, labels = full_names) +
        labs(x = "Year", y = "Number of employees", colour = "Company",
             title = paste(job_type[[idx]], "Employees"))
})
## [[1]]

## 
## [[2]]

## 
## [[3]]

To answer the question of when did each company reach the minimum and maximum number of employees, we can directly work on employees with the dplyr functions group_by and summarize.

minmax <- employees |>
    group_by(company) |>
    summarize(min = min(total), when_min = month[which.min(total)],
              max = max(total), when_max = month[which.max(total)])
company min when_min max when_max
american 62290 2013-09-01 109171 2018-06-01
delta 46410 2006-11-01 94675 2023-01-01
federal 84885 1990-01-01 270383 2021-03-01
united 45781 2011-06-01 102046 2001-03-01

Here we’ll plot the evolution in time of the fraction of part-time workers over the total workforce of each company.

employees |>
    mutate(part_time = 100 * part_time / total) |>
    ggplot(aes(x = month, y = part_time, colour = company)) +
        geom_line(lwd = 0.8) +
        scale_colour_manual(values = my_palette, labels = full_names) +
        labs(x = "Year", y = "Part-time employees (%)", colour = "Company")

COVID-19 Effect

To analyse the influence of the COVID-19 pandemic, we can zoom in on the period from 2019 to 2023 in the total workforce’s plot. We also add a reference line with the previously-found historic minima of the workforce.

employees |>
    dplyr::filter(year(month) > 2018) |>
    ggplot(aes(x = month, y = total, colour = company)) +
        geom_line(lwd = 0.8) +
        geom_hline(aes(yintercept = min, colour = company,
                       linetype = "Historic minimum"),
                   minmax, lwd = 0.6) +
        scale_linetype_manual(name = "", values = rep("dashed", 4)) +
        scale_colour_manual(values = my_palette, labels = full_names) +
        labs(x = "Year", y = "Number of employees", colour = "Company")

We see indeed that two out of four companies – Delta and United – took a deep hit from the pandemic, resulting in a depletion of the workforce almost to an all-time low, especially for Delta. American apparently felt the crisis a bit less, but we can still observe a dip before the start of 2021.

Federal Express is, on the other hand, in complete contrast with the other three companies: its workforce even grew in the pandemic period. This is perhaps not that surprising if we consider that FedEx is a giant corporation that focuses on basically every aspect of transportation, not simply passenger air travel like the other three companies.

New York City Flights

The data we want to analyse here comes from the nycflights13 package.

library(nycflights13)
my_palette <- rcartocolor::carto_pal(10, "Safe")[1:3]

First, let’s have a look at the four tibbles in the dataset.

airlines tibble
carrier name
9E Endeavor Air Inc. 
AA American Airlines Inc. 
AS Alaska Airlines Inc. 
B6 JetBlue Airways
DL Delta Air Lines Inc. 
EV ExpressJet Airlines Inc. 
airports tibble
faa name lat lon alt tz dst tzone
04G Lansdowne Airport 41.13047 -80.61958 1044 -5 A America/New_York
06A Moton Field Municipal Airport 32.46057 -85.68003 264 -6 A America/Chicago
06C Schaumburg Regional 41.98934 -88.10124 801 -6 A America/Chicago
06N Randall Airport 41.43191 -74.39156 523 -5 A America/New_York
09J Jekyll Island Airport 31.07447 -81.42778 11 -5 A America/New_York
0A9 Elizabethton Municipal Airport 36.37122 -82.17342 1593 -5 A America/New_York
flights tibble
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time distance hour minute time_hour
2013 1 1 517 515 2 830 819 11 UA 1545 N14228 EWR IAH 227 1400 5 15 2013-01-01 05:00:00
2013 1 1 533 529 4 850 830 20 UA 1714 N24211 LGA IAH 227 1416 5 29 2013-01-01 05:00:00
2013 1 1 542 540 2 923 850 33 AA 1141 N619AA JFK MIA 160 1089 5 40 2013-01-01 05:00:00
2013 1 1 544 545 -1 1004 1022 -18 B6 725 N804JB JFK BQN 183 1576 5 45 2013-01-01 05:00:00
2013 1 1 554 600 -6 812 837 -25 DL 461 N668DN LGA ATL 116 762 6 0 2013-01-01 06:00:00
2013 1 1 554 558 -4 740 728 12 UA 1696 N39463 EWR ORD 150 719 5 58 2013-01-01 05:00:00
planes tibble
tailnum year type manufacturer model engines seats speed engine
N10156 2004 Fixed wing multi engine EMBRAER EMB-145XR 2 55 NA Turbo-fan
N102UW 1998 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NA Turbo-fan
N103US 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NA Turbo-fan
N104UW 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NA Turbo-fan
N10575 2002 Fixed wing multi engine EMBRAER EMB-145LR 2 55 NA Turbo-fan
N105UW 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NA Turbo-fan

Flights per Day

Now we’ll plot the total number of flights departing from each of the three NYC airports (JFK, LGA and EWR) as a function of time.

flights$date <- as_date(flights$time_hour, tz = "EST")

flights |>
    count(date, origin) |>
    ggplot(aes(x = date, y = n, colour = origin)) +
        geom_point(size = 0.8) +
        geom_smooth(aes(fill = origin), lwd = 0.8) +
        scale_colour_manual(values = my_palette) +
        scale_fill_manual(values = my_palette, guide = "none") +
        labs(x = "Date", y = "Number of flights", colour = "Airport") +
        guides(colour = guide_legend(override.aes = list(fill = NA)))

Another plot: this time, the average number of flights per day in a week, keeping working days and weekends separate, as a function of the week number in the year.

flights |>
    mutate(day_type = ifelse(wday(date) %in% c(1, 7),
                             "weekend", "workday")) |>
    # count by day, keeping info on week and day_type
    count(week = week(date), day, day_type) |>
    # regroup by week and day_type to get the averages we want
    group_by(week, day_type) |>
    summarize(avg = mean(n)) |>
    ggplot(aes(x = week, y = avg, colour = day_type)) +
        geom_line(lwd = 0.8) +
        scale_colour_manual(values = my_palette,
                            labels = c("Weekends", "Workdays")) +
        labs(x = "Week", y = "Average flights per day",
             colour = "Day type")

Looking at the workdays’ plot, we can notice a big dip around the 20th and 30th week, one next to the 50th and another in the last. The last dip is clearly caused by Christmas and New Year’s eve, while the other two line up perfectly with the US Independence Day on the 4th of July (27th week of the year) and Thanksgiving, which in 2013 was on the 28th of November (48th week of the year). Indeed:

Flights in the workweeks around July 4th
week n
25 4949
26 4943
27 4491
28 5012
29 4999
Flights in the workweeks around Thanksgiving
week n
46 4907
47 4891
48 4302
49 4832
50 4799

In the case of Thanksgiving in particular, the decrease in the workday average is coupled to a sharp increase in the flights over the weekend. Elsewhere, the weekends present some big dips in some not-so-easily explainable parts of the calendar, but this is perhaps simply a matter of less data – we’re averaging over two days instead of five – leading to larger fluctuations.

Departure Delays

Now, let’s examine the trend in the departure delays. We’ll calculate the minimum, maximum and average delay for each day of the year, keeping the three airports’ data separate.

dep_delays <- flights |>
    # some flights don’t have a recorded departure and/or arrival time
    drop_na(dep_delay) |>
    group_by(date, origin) |>
    summarize(avg = mean(dep_delay),
              min = min(dep_delay),
              max = max(dep_delay))

plot_delays <- function(key) {
    names <- c(
        "Maximum" = "max",
        "Minimum" = "min",
        "Average" = "avg"
    )

    ggplot(dep_delays, aes(x = date, y = .data[[names[key]]],
                           colour = origin)) +
        geom_point(size = 0.8) +
        geom_smooth(aes(fill = origin), lwd = 0.8) +
        scale_colour_manual(values = my_palette) +
        scale_fill_manual(values = my_palette, guide = "none") +
        labs(x = "Date", y = paste(key, "departure delay (min)"),
            colour = "Airport") +
        guides(colour = guide_legend(override.aes = list(fill = NA)))
}

Let’s examine the average delay first.

plot_delays("Average")

It’s difficult to identify any noteworthy trend – perhaps only a slight increase of the delays in the summer months, which is reasonable. Also, there’s not a significant difference between the three airports.

Now, let’s look at the maxima and minima.

plot_delays("Maximum")

plot_delays("Minimum")

Nothing too significant here too.

Plane Speed

Here we’ll estimate the average speed of planes taking flight in a given day. To do it, we exploit the distance and air_time variables in the database, and simply divide the first by the second – after converting to kilometers per hour, of course. For completeness, we add a ribbon of 2 standard deviations width around the mean.

flights |>
    drop_na(air_time) |>
    # distances are in miles, air_time is in minutes
    mutate(speed = distance * 1.60934 * 60 / air_time) |>
    group_by(date) |>
    summarize(avg = mean(speed), sd = sd(speed)) |>
    ggplot(aes(x = date, y = avg)) +
        geom_ribbon(aes(ymin = avg - sd, ymax = avg + sd),
                    fill = "#627bb4", alpha = 0.5,
                    colour = "transparent") +
        geom_line(colour = "#4a417b", lwd = 0.8) +
        labs(x = "Date", y = "Average speed during the day (km/h)")

Company Comparison

Lastly, we’ll look at the difference between the various airline companies (the carrier column in the dataset). For example, we can list the companies offering the two most flights per day…

flights |>
    count(date, carrier) |>
    slice_max(n = 2, by = date, order_by = n) |>
    head(10) |>
    kbl()
date carrier n
2013-01-01 UA 165
2013-01-01 B6 163
2013-01-02 UA 170
2013-01-02 B6 162
2013-01-03 B6 162
2013-01-03 UA 159
2013-01-04 B6 161
2013-01-04 UA 161
2013-01-05 B6 154
2013-01-05 UA 117

… and per week.

flights |>
    count(week = week(date), carrier) |>
    slice_max(n = 2, by = week, order_by = n) |>
    head(10) |>
    kbl()
week carrier n
1 B6 1107
1 UA 1067
2 UA 1034
2 B6 993
3 UA 1032
3 B6 968
4 UA 1032
4 B6 959
5 UA 1042
5 B6 965

Also, we can get the airline company offering the least flights per month.

flights |>
    count(month, carrier) |>
    slice_min(by = month, order_by = n) |>
    mutate(month = month.abb[month]) |>
    kbl()
month carrier n
Jan OO 1
Feb HA 28
Mar YV 18
Apr HA 30
May HA 31
Jun OO 2
Jul HA 31
Aug OO 4
Sep OO 20
Oct HA 21
Nov OO 5
Dec HA 28

As a last example, we can show which airline company offered the longest distance flight in each month.

flights |>
    group_by(month) |>
    summarize(carrier = carrier[which.max(distance)],
              max_dist = max(distance)) |>
    mutate(month = month.abb[month]) |>
    kbl()
month carrier max_dist
Jan HA 4983
Feb HA 4983
Mar HA 4983
Apr HA 4983
May HA 4983
Jun HA 4983
Jul HA 4983
Aug HA 4983
Sep HA 4983
Oct HA 4983
Nov HA 4983
Dec HA 4983

It turns out that this is a recurring flight between the JFK airport in New York and the Honolulu airport in Hawaii.

flights |>
    dplyr::filter(distance == max(distance)) |>
    distinct(origin, dest) |>
    kbl()
origin dest
JFK HNL